PyMR
Introduction PyMR is a Python 2.7.x library which implements a MapReduce algorithm. The aim of this library is to easily design prototypes of algorithms using MapReduce. Download The source code is available on Github. Advices In this section, you will see some advices that can help you to desing a Python source code using PyMR. PyMR and multi-threading There is an argument in the main class MapReduce ''of PyMR library which defines the number of different threads. In general, defining a number of thread higher than 1 will not really accelerate the execution of the algorithm. In fact, the execution is slowed down because the program must access the hard drive (except when the map ou reduce function is really time-consuming). See here for an example. '''Warnings' This section will warns you about some bad situations that can happen when using PyMR. PyMR and Dropbox PyMR is a direct implementation of the algorithm described in the reference bookLeskovec, Rajaraman, Ullman, Mining of Massive Datasets, available at http://infolab.stanford.edu/~ullman/mmds/book.pdf'. '''It implies that a single execution of PyMR.mapreduce generate ''thousands files. it is highly recommended not execute the code while Dropbox syncronization is enabled. PyMR and big data For now, PyMR works on big data but since the hard drive acess is really slow, the execution of PyMR can take some hours or days in function of your hard drive (SSD or not). Anyway, the algorithm works over every data but is is recommended to design a prototype of you algorithm with PyMR and execute it over a reduced amount of data. After, you can design a better programm using for instance Hadoop for better performances. However, if you have time, you can still use PyMR if you want! Known Bugs PyMR and multi-threading with self attributes in mapper/reducer With the actual implementation of PyMR, there exist a bug when you use self methods with 2 threads and more. It comes from the fact that there exist only one instance of the mapper and the reducer. When your method map ''or ''reduce modifies'' a self attribute, then the execution can crash or produce bad results. But you can still initialise and/or read self attributes (for example, see matrix-vector example). '''License' This implementation is provided under the MIT License. Standard execution of the MapReduce algorithm with PyMR Step 1 : Creating Chunks At first, MapReduce will create chunks of 50 Mo. The class called to create chunk is ChunkFactory. It will simply read all your input files and divide them into subfiles of 50 Mo. Step 2 : Local Mapping and Grouping Once the data are split into chunks, the algorithm will map all chunks one by one. The map is easy : an iterator is created (see class MapChunkIterator) and is given to the variable context theContext. Once the context is instantiated with the iterator, it is given to the user-defined method map. All key/values produced in the map are stored in a list. Once the map ''function is terminated, we will apply a local grouper over the list of key/values. The goal is to put all values having the same key in the same file (a value file). One valuefile is created by key and by execution of a local grouper. One big structure is stored in the grouper : a big dictionary containing all pair key/nodefile. The nodefile (always ascociated to one and only one key) is a file containing all valueFileName created since the instanciation of the grouper (each times a grouper is called, he will create a new value file, but will update the node file). A valueFile is a file contaning some value ascociated to only one key. The execution of the grouper is described here, step-by-step : # Load the big key-nodefile dictionary (we will call it bigDict) from savestate file. Also, create a new and temporary dictionary (we will call it tempDict) that will link all keys to a list of values. # Read all key-values pairs and put them in tempDict. # Copy all nodeFiles from the previous execution into new nodeFiles (with this manipulation, a crash will not affect the old output). # Check if all keys in tempDict are referenced by bigDict. Create a new (empty) nodeFile for each keys not present in bigDict and create the new entry key-nodefile in bigDict . # For each key, write the list tempDictkey in a value file and append the valueFile name to the content of the ascociated nodeFile. # Write a new save state file which containts bigDict. '''Step 3 : Global Grouping' This step is very simple : the global grouping consist in reading all node files from all groupers, and put them into only one big node file. Step 4 : Reducing This step is very similar to the mapping. We choose a key and open the ascociated node file. Then the user-defined reducer will browse all value files and their contents (which are the values). At the end, we put the output of the reducer in the output dictionary, which contains all reduced key-value pairs. Documentation In this section we will describe all component, classes and functions in the PyMR library. ChunksFactory The goal of this class is to divide easily a list of files into chunks of 50Mo. __init__(self, files) Input '': a list of string, which are filenames. The file referenced by the filenames in the list must have one input/line (by input, we mean one entry for the Mapper). ''Output '': Instanciation of ChuncksFactory. '''divideIntoChunks(self,filenameGenerator)' Input '': A function which returns strings. These strings will be used in order to create chunks. ''Output '': Nothing ''Details '': Will create chunks of size ~50Mo. '''fileHelper' appendListInFile(fileName,stackOfValues) Inputs '': A string and a list of values. The string must be an existing file name. ''Output '': Nothing. ''Details '': Will put the values present in the list at the end of the file referenced by the input filename. '''appendFileInFile(inFileName,outFileName)' Inputs '': Two strings. These strings must be existing files. ''Output '': Nothing. ''Details '': Will put the values present in the file inFileName' at the end of the file outFileName. '''transformTextIntoListOfWords(inFileNameList,outFileName) Inputs '': A list of strings and a string. The list of strings must be a list of existing filenames. ''Output '': Nothing. ''Details '': If the file outFilename does not exist, then a new file is created. Otherwise, the file is erased. outFilename will contain all words of the files present in inFileNameList but with one word/line. It will also remove some punctuation marks : ',' ; '.' ; '?' ; '!' and all words will be witten with lower-case letters. '''writeListInFile(fileName,stackOfValues)' Inputs '': A string and a list of strings. The string fileName is a file name. ''Output '': Nothing. ''Details '': If a file with name fileName exists, then it will be erased. It will put the values pesent in the list in the file referenced by the input fileName. '''writeListInFileWordsRemovePunctation(fileName,stackOfValues)' Inputs '': A string and a list of strings. The string fileName is a file name. ''Output '': Nothing. ''Details '': If a file with name fileName exists, then it will be erased. Will parse the list in order to split it into words, without punctuation. Then write the words (1 word/line) in the file referenced by fileName. '''copyFile(inputFile,outputFile)' Inputs '': Two strings. The String inputFile must be the name of an existing file. ''Output '': Nothing. Details : If a file with name outputFile exists, then it will be erased. The content of the file inputFile is copied into the file outputFile. '''writeDictionnary(outputFile,dictio)' Inputs '': A string and a dictionary. ''Output '': Nothing. Details : If a file with name outputFile exists, then it will be erased. Write the content of the dictionary into the file outputFile, with the following structure : for each key, values, write "key : value\n". '''GroupChunkFromMapIterator' The goal of this class is to iterate over elements present in a chunk created by the map. ' __init__(self,fileName,nElem)' Inputs '': A string and the number of lines present in the chunks. The string fileName must reference a chunk created by the map operation ''Output '': instanciation of the class. '''loadNext(self)' Input '': Nothing. ''Ouput '': A line in a chunk which is not analysed yet. '''hasNext(self)' Input '': Nothing. ''Output '': Boolean. True if there are unanalyzed lines left, otherwise false . '''mapChunkIterator' The goal of this class is to iterate over elements present in a chunk in order to analyse those elements using the user-defined map method. __init__(self,fileName) Input '': A string. The string filename must reference a file created by chunkFactory. ''Output '': Instanciation of mapChunkIterator. '''getNext(self)' Input '': Nothing. ''Output '': An unanalyzed line present in the chunk. '''MapContext' Reads a chunk, stores it in memory, and can pop all elements of the chunk. Also, puts all key-values pairs computed by the mapper into a list. __init__(self,outputMapFileName,iterator) Inputs '': A string and a mapChunkIterator. The string is an output file and does not need to be an existing file. This file will be normally created by the grouper. ''Output '': Instanciation of the class. '''putKeyValue(self,key,value)' Inputs : ''Two things, a key and a value. It is strongly adviced to put two strings as inputs. ''Output : ''Nothing. Just append the key and the value into a list. '''loadNext(self)' Input : Nothing Output '': Update fields self.key and self.value into new ones. ''Details : This function is not supposed to be used in the map operation! hasNext(self) Input : Nothing Output '': True if there are elements left in the field self.keysList and self.valuesList. False else. ''Details : This function is not supposed to be used in the map operation! MapReduce __init__(self,theMapper,theReducer,listOfFiles,silent=0,nThreads=1) Inputs '': * theMapper : The user-defined mapper class. See here for more details. * theReducer : The user-defined reducer class. See here for more details. * listOfFiles : A list contaning all parsed files name. Warning : For now, even if you have only one single file, you have to put it in a list. * silent (default 0) : An integer. (silent =1) Excution is silent. (silent = 0) Prints the evolution of the algorithm. (silent = -1) The output is also printed. * nThreads (default 1). A strictly positive integer. Number of threads that will be used to execute the MapReduce algortihm. Note that if the number of chuncks is lower than the number of threads, then nThreads = nChuncks. ''Outputs '': Instanciation of MapReduce. '''execute(self)' Input '': Nothing. ''Output '': Nothing. ''Details '': Launch the MapReduce algorithm. The details of the execution is detailed in the section Standard execution of the MapReduce algorithm with PyMR. '''reduceContext' __init__(self,key,iterator) Inputs '': A string key and a ReduceFromGroupIterator. ''Output '': Instanciation of the class. '''loadNextValue(self)' Inputs : Nothing. Output : Pops a key and a associated value present in the list keyList and valueList in the iterator. reduceFromGroupIterator __init__ (self,globalNodeFileName) Input '': A string. ''Output '': Nothing. ''Details '': Will load in memory all files present in the file referenced by globalNodeFileName. Will load also the key of the nodeFile. '''getNext(self)' Input '': Nothing. ''Output '': Returns an unanalysed value associated to the key of the nodeFile. '''Examples' Counting words See here. Matrix-vector multiplication See here. Simulation of picture similarity See here.